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t^ ■ Abstract 

We consider the problem of estimating a sparse linear regression vector p* under a gaussian 

[^ . noise model, for the purpose of both prediction and model selection. We assume that prior 

O \ knowledge is available on the sparsity pattern, namely the set of variables is partitioned into 

prescribed groups, only few of which are relevant in the estimation process. This group sparsity 

assumption suggests us to consider the Group Lasso method as a means to estimate /?* . We 

establish oracle inequalities for the prediction and £2 estimation errors of this estimator These 

/\ ' bounds hold under a restricted eigenvalue condition on the design matrix. Under a stronger 

j^ ■ coherence condition, we derive bounds for the estimation error for mixed (2,p)-norms with 

1 < p < 00. When p = 00, this result implies that a threshold version of the Group Lasso 

estimator selects the sparsity pattern of /3* with high probability. Next, we prove that the rate 

of convergence of our upper bounds is optimal in a minimax sense, up to a logarithmic factor, 

for all estimators over a class of group sparse vectors. Furthermore, we establish lower bounds 

for the prediction and (.2 estimation errors of the usual Lasso estimator. Using this result, we 

demonstrate that the Group Lasso can achieve an improvement in the prediction and estimation 

properties as compared to the Lasso. 

An important application of our results is provided by the problem of estimating multiple 
regression equation simultaneously or multi-task learning. In this case, our results lead to 
refinements of the results in fT2\ and allow one to establish the quantitative advantage of the 
Group Lasso over the usual Lasso in the multi-task setting. Finally, within the same setting, we 
show how our results can be extended to more general noise distributions, of which we only 
require the fourth moment to be finite. To obtain this extension, we establish a new maximal 
moment inequality, which may be of independent interest. 



1 Introduction 

Over the past few years there has been a great deal of attention on the problem of estimating a 
5pari'4j regression vector /3* from a set of linear measurements 

y = X(3* + W. (1.1) 

Here X is a given N x K design matrix and 14^ is a zero mean random variable modeling the 
presence of noise. 

A main motivation behind sparse estimation comes from the observation that in several practi- 
cal applications the number of variables K is much larger than the number N of observations, but 
the underlying model is known to be sparse, see |f8l fT2T| and references therein. In this situation, 
the ordinary least squares estimator is not well-defined. A more appropriate estimation method 
is the £i-norm penalized least squares method, which is commonly referred to as the Lasso. The 
statistical properties of this estimator are now well understood, see, e.g., flU [6l |71 [181 |2Tl |36| and 
references therein. In particular, it is possible to obtain oracle inequalities on the estimation and 
prediction errors, which are meaningful even in the regime K ^ N. 

In this paper, we study the above estimation problem under additional structural conditions on 
the sparsity pattern of the regression vector (3* . Specifically, we assume that the set of variables can 
be partitioned into a number of groups, only few of which are relevant in the estimation process. 
In other words, not only we require that many components of the vector (3* are zero, but also 
that many of a priori known subsets of components are all equal to zero. This structured sparsity 
assumption suggests us to consider the Group Lasso method ||39ll as a mean to estimate (3* (see 
equation (12.21) below). It is based on regularization with a mixed (2, l)-norm, namely the sum, 
over the set of groups, of the square norm of the regression coefficients restricted to each of the 
groups. This estimator has received significant recent attention, see Il3l [TOl [T6l [TTl [T9ll24l |25l |26l 
|28l[3T]| and references therein. Our principal goal is to clarify the advantage of this more stringent 
group sparsity assumption in the estimation process over the usual sparsity assumption. For this 
purpose, we shall address the issues of bounding the prediction error, the estimation error as well 
as estimating the sparsity pattern. The main difference from most of the previous work is that 
we obtain not only the upper bounds but also the corresponding lower bounds and thus establish 
optimal rates of estimation and prediction under group sparsity. 

A main motivation for us to consider the group sparsity assumption is the practically important 
problem of simultaneous estimation the coefficient of multiple regression equations 



(1.2) 



Here Xi, . . . , Xt are prescribed n x M design matrices, (31, . . . , /3^ G M.^'^ are the unknown 
regression vectors which we wish to estimate, yi . . . , yx are n-dimensional vectors of observations 
and Wi, . . . , Wt are i.i.d. zero mean random noise vectors. Examples in which this estimation 
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'The phrase "f3* is sparse" means that most of the components of this vector are equal to zero. 



problem is relevant range from multi-task learning ^ |23l |28l and conjoint analysis [O |20l to 
longitudinal data analysis ifTTI as well as the analysis of panel data ifTSl [38l . among others. We 
briefly review these different settings in the course of the paper. In particular, multi-task learning 
provides a main motivation for our study. In that setting each regression equation corresponds to a 
different learning task; in addition to the requirement that M ^ ri, we also allow for the number 
of tasks T to be much larger than n. Following [|2l we assume that there are only few common 
important variables which are shared by the tasks. That is, we assume that the vectors /3^, . . . , /3J 
are not only sparse but also have their sparsity patterns included in the same set of small cardinality. 
This group sparsity assumption induces a relationship between the responses and, as we shall see, 
can be used to improve estimation. 

The model (|1.2I) can be reformulated as a single regression problem of the form (11.11) by setting 
K = MT, N = riT, identifying the vector (3 by the concatenation of the vectors /3i, . . . , /3r and 
choosing X to be a block diagonal matrix, whose blocks are formed by the matrices Xi, . . . , Xt, 
in order. In this way the above sparsity assumption on the vectors /3( translate in a group sparsity 
assumption on the vector /?*, where each group is associated with one of the variables. That is, 
each group contains the same regression component across the different equations (11.21) . Hence 
the results developed in this paper for the Group Lasso apply to the multi-task learning problem as 
a special case. 

1.1 Outline of the main results 

We are now ready to summarize the main contributions of this paper. 

• We first establish bounds for the prediction and £2 estimation errors for the general Group 
Lasso setting, see Theorem 13. II In particular, we include a "slow rate" bound, which holds 
under no assumption on the design matrix X. We then apply the theorem to the specific 
multi-task setting, leading to some refinements of the results in [22]. Specifically, we demon- 
strate that as the number of tasks T increases the dependence of the bound on the number of 
variables M disappears, provided that M grows at the rate slower than exp(T). 

• We extend previous results on the selection of the sparsity pattern for the usual Lasso to the 
Group Lasso case, see Theorem 15.11 This analysis also allows us to establish the rates of 
convergence of the estimators for mixed (2,p)-norms with 1 < p < 00 (cf. Corollar\ J5.1l) . 

• We show that the rates of convergence in the above upper bounds for the prediction and 
(2, p)-norm estimation errors are optimal in a minimax sense (up to a logarithmic factor) for 
all estimators over a class of group sparse vectors /?*, see Theorem 16. 1[ 

• We prove that the Group Lasso can achieve an improvement in the prediction and estimation 
properties as compared to the usual Lasso. For this purpose, we establish lower bounds for 
the prediction and £2 estimation errors of the Lasso estimator (cf. Theorem 17.11 ) and show 
that, in some important cases, they are greater than the corresponding upper bounds for the 
Group Lasso, under the same model assumptions. In particular, we clarify the advantage of 
the Group Lasso over the Lasso in the multi-task learning setting. 



• Finally, we present an extension of the multi-task learning analysis to more general noise 
distributions having only bounded fourth moment, see Theorems 18. II and 18.21 this extension 
is not straightforward and needs a new tool, the maximal moment inequality of Lemma [9?T1 
which may be of independent interest. 

1.2 Previous work 

Our results build upon recently developed ideas in the area of compressed sensing and sparse 
estimation, see, e.g., |l4l [H [121 \lM and references therein. In particular, it has been shown by 
different authors, under different conditions on the design matrix, that the Lasso satisfies sparsity 
oracle inequalities, see [SI [6l |71 [2ll [181 IIH |4T]| and references therein. Closest to our study is 
the paper [4J, which relies upon a Restricted Eigenvalue (RE) assumption as well as ETI . which 
considered the problem of selection of sparsity pattern. Our techniques of proofs build upon and 
extend those in these papers. 

Several papers analyzing statistical properties of the Group Lasso estimator appeared quite 
recently |[3l[l0l[l6l[l9llM[25l[26l[3l]l. Most of them are focused on the Group Lasso for additive 
models [[T6l [T9l |25l [3T1l or generalized linear models (2^. Special choice of groups is studied 
in [[Toll . Discussion of the Group Lasso in a relatively general setting is given by Bach [31 and 
Nardi and Rinaldo 1,26.1 . Bach [^ assumes that the predictors (rows of matrix X) are random 
with a positive definite covariance matrix and proves results on consistent selection of sparsity 
pattern J{/3*) when the dimension of the model (K in our case) is fixed and N ^ oo. Nardi 
and Rinaldo II26I address the issue of sparsity oracle inequalities in the spirit of [41 under the 
simplifying assumption that all the Gram matrices "^j (see the definition below) are proportional 
to the identity matrix. However, the rates in their bounds are not precise enough (see comments in 
|[22[| ) and they do not demonstrate advantages of the Group Lasso as compared to the usual Lasso. 
Obozinski et al. |[28l consider the model (11.21 ) where all the matrices Xt are the same and all their 
rows are independent Gaussian random vectors with the same covariance matrix. They show that 
the resulting estimator achieves consistent selection of the sparsity pattern and that there may be 
some improvement with respect to the usual Lasso. Note that the Gaussian Xt is a rather particular 
example, and Obozinski et al. |[28l focused on the consistent selection, rather than exploring 
whether there is some improvement in the prediction and estimation properties as compared to 
the usual Lasso. The latter issue has been addressed in our work II22I and in the parallel work of 
Huang and Zhang Iil7il . These papers considered only heuristic comparisons of the two estimators, 
i.e., those based on the upper bounds. Also the settings treated there did not cover the problem in 
whole generality. Huang and Zhang |[T7l considered the general Group Lasso setting but obtained 
only bounds for prediction and £2 estimation errors, while |[22ll focused only on the multi-task 
setting, though additionally with bounds for more general mixed (2, p)-norm estimation errors and 
consistent pattern selection properties. 

1.3 Plan of the paper 

This paper is organized as follows. In Section[2lwe define the Group Lasso estimator and describe 
its application to the multi-task learning problem. In Sections[3land[4|we study the oracle properties 
of this estimator in the case of Gaussian noise, presenting upper bounds on the prediction and 



estimation errors. In Section [51 under a stronger condition on the design matrices, we describe 
a simple modification of our method and show that it selects the correct sparsity pattern with an 
overwhelming probability. Next, in Section [6] we show that the rates of convergence in our upper 
bounds on prediction and (2, ]9)-norm estimation errors with 1 < p < oo are optimal in a minimax 
sense, up to a logarithmic factor. In Section |7] we provide a lower bound for the Lasso estimator, 
which allows us to quantify the advantage of the Group Lasso over the Lasso under the group 
sparsity assumption. In Section [8] we discuss an extension of our results for multi-task learning to 
more general noise distributions. Finally, Section [9] presents a new maximal moment inequality 
(an extension of Nemirovski's inequality from the second to arbitrary moments), which is needed 
in the proofs of Section [8l 

2 Method 

In this section, we introduce the notation and describe the estimation method, which we analyze in 
the paper. We consider the linear regression model 



y = X(3* + W, 



(2.1) 



where (3* G M is the vector of regression coefficients, X h an N y, K design matrix, y G M is 
the response vector and W G M^ is a random noise vector which will be specified later. We also 
denote by x^^, . . . , x^ the rows of matrix X. Unless otherwise specified, all vectors are meant to 
be column vectors. Hereafter, for every positive integer £, we let N^ be the set of integers from 1 
and up to L Throughout the paper we assume that X is a deterministic matrix. However, it should 
be noted that our results extend in a standard way (as discussed, e.g., in BU, |[8l) to random X 
satisfying the assumptions stated below with high probability. 

We choose M < K and let the set Gi, . . . , Gm form a prescribed partition of the index set 'Hk 
in M sets. That is, ^k = ^jiiGj and, for every j ^ f, Gj n Gji = 0. For every j G Nm, we 
let Kj = \Gj\ be the cardinality of Gj and denote by X^ the N x Kj sub-matrix of X formed by 
the columns indexed by Gj. We also use the notation \1' = X^X/N and ^j = X.q.'Kg/N for the 
normalized Gram matrices of X and Xg , respectively. 

For every (3 G M^ we introduce the notation (3^ = {(3k : k E Gj) and, for every 1 < p < oo, 
we define the mixed (2, p)-norm of (3 as 



2,P 



and the (2, oo)-norm of /3 as 
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where || ■ || is the standard Euclidean norm. 

If J C Nj\/ we let (3j be the vector {(3U{j E J} : j E Mm), where /{■} denotes the indicator 
function. Finally we set J(/5) = {j : /3^' ^ 0, j G Na/} and M(/3) = | J(/3)| where \J\ denotes 
the cardinality of set J C {1, . . . , M}. The set J(/3) contains the indices of the relevant groups 



and the number M{(3) the number of such groups. Note that when M = K v^e have Gj = {j}, 
j G Nk and \\/3\\2,p = ||/3||p, where ||/3||p is the £p norm of /S. 

The main assumption we make on (3* is that it is group sparse, which means that M{(3*) is 
much smaller than M. 

Our main goal is to estimate the vector (3* as well as its sparsity pattern J{(3*) from y. To this 
end, we consider the Group Lasso estimator. It is defined to be a solution (3 of the optimization 
problem 

min I l||X/3 - yf + 2 f] Xj\\P^\ : /3 e M^ I , (2.2) 

where Ai, . . . , Am are positive parameters, which we shall specify later. 

In order to study the statistical properties of this estimator, it is useful to present the optimality 
conditions for a solution of the problem (|2.2I) . Since the objective function in (12.21) is convex, (3 is 
a solution of (12.21) if and only if (the i^-dimensional zero vector) belongs to the subdifferential 
of the objective function. In turn, this condition is equivalent to the requirement that 



-V(^||X/3-y"2 



^e2d(f2^S'\\] 



where d denotes the subdifferential (see, for example, (Si for more information on convex analy- 
sis). Note that 

d ( f] A,l|/3i \=SeeR'':e^ = Xj^ if P^ y^ 0, and ||^^|| < A, if (3^ = 0, j G Nm|. 

Thus, /3 is a solution of (12.21) if and only if 

hx^y-X^)y = A,^, if/3V0 (2.3) 

l\\{X^y-X^)y\\ < A„ if;3^- = 0. (2.4) 

2.1 Simultaneous estimation of multiple regression equations and multi- 
task learning 

As an application of the above ideas we consider the problem of estimating multiple linear regres- 
sion equations simultaneously. More precisely, we consider multiple Gaussian regression models, 

1/1 = x,pi + w, 

y2 2f^2 ^2.5) 

t/T = Xt(3^ + Wt, 




where, for each t E Nt, we let Xt be a prescribed n x M design matrix, (3^ E M^ the unknown 
vector of regression coefficients and yt an n-dimensional vector of observations. We assume that 
Wi, . . . , Wt are i.i.d. zero mean random vectors. 

We study this problem under the assumption that the sparsity patterns of vectors (3^ are for any 
t contained in the same set of small cardinality s. In other words, the response variable associated 
with each equation in (12.51) depends only on some members of a small subset of the corresponding 
predictor variables, which is preserved across the different equations. We consider as our estimator 
a solution of the optimization problem 

M / T \2 1 

-y,f + 2XY,iY.f^ij ■■Pi,---,PTeR''\ (2.6) 

with some tuning parameter A > 0. As we have already mentioned in the introduction, this estima- 
tor is an instance of the Group Lasso estimator described above. Indeed, set K = MT, N = riT, 
let (3 E M^ be the vector obtained by stacking the vectors (3i, . . . , (3t and let y and W be the ran- 
dom vectors formed by stacking the vectors yi, . . . ,yT and the vectors Wi, . . . , Wt, respectively. 
We identify each row index of X with a double index (t, i) E Nt x N„ and each column index 
with (t, j) E Nt X Nm- In this special case the matrix X is block diagonal and its i-th block is 
formed by the n x M matrix Xt corresponding to "task t". Moreover, the groups are defined as 
Gj = {(t, j) : t E Nt} and the parameters \j in (12.21 ) are all set equal to a common value A. 
Within this setting, we see that (|2.6I) is a special case of (12.21) . 

Finally, note that the vectors /3^ = {(3tj : t E NtY are formed by the coefficients corresponding 
to the j-th variable "across the tasks". The set J(/3) = {j : (3^ ^ 0, j E Nm} contains the 
indices of the relevant variables present in at least one of the vectors /3i, . . . , (3t and the number 
M{/3) = \J{/3)\ quantifies the level of group sparsity across the tasks. The structured sparsity (or 
group sparsity) assumption has the form M{/3*) < s where s is some integer much smaller than 
M. 

Our interest in this model with group sparsity is mainly motivated by multi-task learning. Let 
us briefly discuss the multi-task setting as well as other applications, in which the problem of 
estimating multiple regression equations arises. 

Multi-task learning. In machine learning, the problem of multi-task learning has received much 
attention recently, see [|2]| and references therein. Here each regression equation corresponds to a 
different "learning task". In this context the tasks often correspond to binary classification, namely 
the response variables are binary. For instance, in image detection each task t is associated with 
a particular type of visual object (e.g., face, car, chair, etc.), the rows xj^ of the design matrix Xt 
represent an image and yti is a binary label, which, say, takes the value 1 if the image depicts the 
object associated with task t and the value —1 otherwise. In this setting the number of samples 
n is typically much smaller than the number of tasks T. A main goal of multi-task learning is to 
exploit possible relationships across the tasks to aid the learning process. 

Conjoint analysis. In marketing research, an important problem is the analysis of datasets con- 
cerning the ratings of different products by different customers, with the purpose of improving 
products, see, for example, [|ll|20l[T4J and references therein. Here the index t E Nt refers to the 



customers and the index i G N„ refers to the different ratings provided by a customer. Products 
are represented by (possibly many) categorical or continuous variables (e.g., size, brand, color, 
price etc.). The observation y^ is the rating of product xu by the t-th customer. A main goal of 
conjoint analysis is to find common factors which determine people's preferences to products. In 
this context, the variable selection method we analyze in this paper may be useful to "visualize" 
peoples perception of products |[T1. 

Seemingly unrelated regressions (SUR). In econometrics, the problem of estimating the regres- 
sion vectors (51 in (12.51) is often referred to as seemingly unrelated regressions (SUR) BOl (see 
also ll34l and references therein). In this context, the index z G N„ often refers to time and the 
equations (12.51) are equivalently represented as n systems of linear equations, indexed by time. The 
underlying assumption in the SUR model is that the matrices Xt are of rank M, which necessarily 
requires that n > M. Here we do not make such an assumption. We cover the case n <^ M 
and show how, under a sparsity assumption, we can reliably estimate the regression vectors. The 
classical SUR model assumes that the noise variables are zero mean correlated Gaussian, with 
cov{Ws, Wt) = Cstlnxn, s, t G Ny. This iuduccs a relation between the responses that can be used 
to improve estimation. In our model such a relation also exists but it is described in a different 
way, for example, we can consider that the sparsity patterns of vectors /3l, . . . , (3^ are the same. 

Longitudinal and panel data. Another related context is longitudinal data analysis fTP| as well 
as the analysis of panel data lfT5l[38l . Panel data refers to a dataset which contains observations 
of different phenomena observed over multiple instances of time (for example, election studies, 
political economy data, etc). The models used to analyze panel data appear to be related to the 
SUR model described above, but there is a large variety of model assumptions on the structure 
of the regression coefficients, see, for example, [15]. Up to our knowledge however, sparsity 
assumptions have not been been put forward for analysis within this context. 

3 Sparsity oracle inequalities 

Let 1 < s < M be an integer that gives an upper bound on the group sparsity M{(3*) of the true 
regression vector (3*. We make the following assumption. 

Assumption 3.1. There exists a positive number n = k{s) such that 

min|J£^ : I Jl < s, A G R^ \ {0}, V AJA^ < 3 V AJA^ I > k, 
>- V II J II j&J'^ jeJ ^ 

where J^ denotes the complement of the set of indices J. 

To emphasize the dependency of As sumption 13 . 1 1 on s, we will sometimes refer to it as As- 
sumption RE(s). This is a natural extension to our setting of the Restricted Eigenvalue assumption 
for the usual Lasso and Dantzig selector from Q. The di norms are now replaced by (weighted) 
mixed (2,l)-norms. 

Several simple sufficient conditions for Assumption 13. II in the Lasso case, i.e., when all the 
groups Gj have size 1, are given in [4J. Similar sufficient conditions can be stated in our more 
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general setting. For example, Assumption 3.1 is immediately satisfied if X^X/N has a positive 
minimal eigenvalue. More interestingly, it is enough to suppose that the matrix X^X/N satisfies 
a Restricted Isometry condition as in f8l or the coherence condition (cf. Lemma |AT2l below) . 

To state our first result we need some more notation. For every symmetric and positive semi- 
definite matrix A, we denote by tr(yl), \\A\\-p^ and |||A||| the trace, Frobenius and spectral norms of 
A, respectively. If pi, . . . , p^ are the eigenvalues of A, we have that ti{A) = J2i=i Pi^ II^IIfi- = 

EtiPf andlPIII =maxf^i 



pi- 

Lemma 3.1. Consider the model (12.11) . and let M > 2, N > 1. Assume that W G M^ is a 
random vector with i.i.d. Af{0, a^) gaussian components, a^ > 0. For every j G Nm, recall that 
\E'j = XJ Xg^/A^ and choose 



A, > ^^tr(vI/,) + 2|||vI/,|||(2glogM+v/i^,glogM). (3.1) 



Then with probability at least 1 — 2M^ '^, for any solution /3 of problem (12. 2|) and all f3 G M^ 
we have that 

1 ^^ 1 

i=i 

+ 4 Y. A,min(||/3^||,||/3^--/3i), (3.2) 

ieJ(/3) 

l||(X-X(/3-/3*)yi|<^A,, (3.3) 

M(/3)<i^||X(/3-/3*)f, (3.4) 



min 



where Amin = min*£]^ \j and 0max is the maximum eigenvalue of the matrix X^X/N. 
Proof. For all /3 G M^^, we have 

^ M M 

-||X/3-yf + 25^A,||/3^||<-||X/3-yf + 25^A,||/3^||, 

which, using y = Xj3* + ly , is equivalent to 

-i|X(/3-/3*)f<-||X(/3-/3*)f + -iy-X(/3-/3) + 25^A,(||/3^||- 11/3^11). (3.5) 

i=i 

By the Cauchy-Schwarz inequality, we have that 

M 

W^X0 - /3) < 5] \\{X^Wy\\0^ - /3^||. 
8 



For every j G Nm, consider the random event 



where 



We note that 



M 

A=f]Aj, (3.6) 



A, = \^\\{X-Wy\\<^\. (3.7) 



where ,^1, . . . , .^at are i.i.d. standard Gaussian, Vj^i, . . . , Vj^n denote the eigenvalues of the matrix 
Xg XJ /A^, among which the positive ones are the same as those of ^j, and the quantity Xj is 

defined as 

A2iV/(4(T2) - tr(^,) 



X 



' V^ll^.llFr 



We apply Lemma IaTTI to upper bound the probability of the complement of the event Aj. Specif- 
ically, we choose v = {vj^i, . . . , Wj,Ar), x = Xj and ■m{v) = |||^j|||/||^i||Fr and conclude from 
Lemma lA.ll that 



x] 



2(1 + V^2;j|||\[^j|||/||^j||Fr 



F{A!^) < 2exp 

We now choose Xj so that the right hand side of the above inequality is smaller than 2M~''. A 
direct computation yields that 






> V2|||^,|||/||^||FrglogM+ J2(|||^,-|||glogM)2 + 2glogM 



which, using the subadditivity property of the square root and the inequality ||^j||Fr < \/^lll^jlll 
gives inequality (13.11) . We conclude, by a union bound, under the above condition on the parameters 
Xj, that P(^^) < 2M^"«'. Then, it follows from inequality (l331) . with probability at least 1 - 
2M^-\ that 

^ M M 

-||X(/3-/3*)|p + 5^A,||/3^--/3^|| < ^||X(/3-/3*)|r + 25^ A,(||/3^--/3^|| + 11/3^11 -11/3^- 

< l||X(/3-/3*)|p + 4 J2 A, min (11/3^11, 11/3^- -/3^||) 



i6J(/3) 



which coincides with inequality (13.21) . 
To prove (13.31) , we use the inequality 



1 

N 



\\{X^y-X/3)y\\<X„ (3.8) 



which follows from the optimality conditions (I2.3h and (12.41) . Moreover, using equation (|2.1I) and 
the triangle inequality, we obtain that 

j^\\{x-x0 - ny\\ < j^\\{x-{x^ -y)y\\ + j^\\{x-wy\\. 

The result then follows by combining the last inequality with inequality (13.81) and using the defini- 
tion of the event A. 

Finally, we prove (|3.4I) . First, observe that, on the event A, it holds, uniformly over j G Nm, 
that 

^\\{X-X0-ny\\>^, if/3V0. 

This fact follows from (|2.3I) . (|2.1I) and the definition of the event A. The following chain yields the 
result: 



ieJ(/3) '^ 



Ar2 ^^ A? 
ieJ( ~ 

4 

ALJV2 



where, in the last line we have used the fact that the eigenvalues of X^X/N are bounded from 
above by 0max- ■ 

We are now ready to state the main result of this section. 

Theorem 3.1. Consider the model (12.11) and let M > 2, N > 1. Assume that W G M^ is a 
random vector with i.i.d. A/'(0, a^) gaussian components, cr"^ > 0. For every j G Nm, define the 
matrix ^ j = X^ Xc^/A^ and choose 

2(7 



A, > -=Jtr(vI/,) + 2|||vI/^|||(2glogM + ^K^qlogM). 

y IS 



Then with probability at least 1 — 2M^ '', for any solution [3 of problem (|2.2I) we have that 

h\X0-n\? < 4||/3*||2,im'ixA,. (3.9) 

If in addition, M{(5*) < s and Assumption 1X71 holds with k, = k,{s), then with probability at least 
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1 — 2M^ '^, for any solution /3 of problem (|2.2I) we have that 

j^\\x0-nr < -.H A?, (3.10) 

jeJ(/3-) 
11/3-/3112,1 < ^ E T^' (3.11) 

M(/3) < ^%^ 5^ -I-, (3.12) 

ieJ(/3*) ™° 

where Amin = iiiin|l^ Aj anJ 0max '■^ the maximum eigenvalue of the matrix X^X/N. If in 
addition, Assumption RE(2s) holds, then with the same probability for any solution (3 of problem 
(|2.2I) we have that 

Proof. Inequality (13.91 ) follows immediately from (13.21) with (3 = (3*. We now prove the remaining 
assertions. Let J = J {(3*) = {j : {(3*y y^ 0} and let A = /3 - /5*. By inequality ^^ with /3 = /3* 
we have, on the event A, that 



l||XAf <45^A,||A1|<4/^||A,||. (3.14) 



jeJ V jeJ 



^Af 



Moreover by the same inequality, on the event A, we have that X]i=i ■^ill^'' II — 4 XljeJ -^ill^"' II' 
which implies that ^.^jc XjWA^l < 3 Xlje/ -^jll^ll- Thus, by Assumption l3.1l 



Aj < ^^. (3.15) 



Now, (llTOl) follows from (IXT41) and (13131) . 

Inequality (13.111) follows by noting that, by (|3.2I) . 

yA,iiAi<4yA,iiAi<4 /yA2||A^ii<4 ZyA^H^ 

and then using i^M and J2f=i II A' II < E,^i ||A1|A,/A^i„. 

Inequality (l37T2l) follows from (D and (l3TT0l) . 

Finally, we prove (13.131) . Let J' be the set of indices in .P corresponding to s largest values of 
AjllAll. Consider the set J2s = J U J'. Note that | J2s| < 2s. Let j(k) be the index of the A;— th 
largest element of the set {Aj 1 1 A-' II : j E J'^}. Then, 

A.wl|A^'^'^ll<E^^-|l^'IIA- 
11 



This and the fact that ^,gjc Aj|| A-'H < 3 Xl^eJ -^ill^'^ll '^^ ^^^ event A implies 



A^ ill II — A^ f^2 

,2 



< {Eiej^MW\)\HEeejM\^' 



s s 

2MI A J|2 n/'V^ \2M| A _ I|2 



^ 9(E,,.A,^)||A,f ^ 9(E,,.A|)||A,, 



Therefore, it follows that 



and, in turn, that 






10 v^ A.^ 



i^ii'^st^a^ii^''"-"'- <^-"'> 



Next note from (13.141) that 



i-||XA|r<4 /^||A,J|. (3.17) 



i6J 



In addition, J2jeJ<= ^j W^'' II — ^ Sjgj -^i 11^'' II ^^^ily implies that 

5^A,||A^||<3 5^A,||A^||. 

Combining Assumption RE(2s) with (13.171) we have, on the event A, that 



Aj2 II < 



V^..^^' 



/€2(2s) 

This inequality and (l3TT6l) yield (I3TT3]) . ■ 

The oracle inequality (13.101) of Theorem 13.11 can be generalized to include the bias term as 
follows. 

Theorem 3.2. Let the assumptions of Lemma I3.il be satisfied and let Assumption I3.il holds with 
K = K,{s) and with factor 3 replaced by 7. Then with probability at least 1 — 2M^^^, for any 
solution (3 of problem (12.21) we have 

l||X(/3-r)f <mini^ E A| + ^||X(/3-r)f :/3GM^M(/3)<. 

f^ ieJ(/3) 
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This result is of interest when (3* is only assumed to approximately sparse, that is when there 
exists a set of indices Jq with cardinality smaller than s such that || (/?*) j^ |P is small. 

Proof. Let j3 be arbitrary. Set A = /3 — /3. By inequality (13.21 ). we have, on the event A that 

M 



-||X(/3-/3*)f + 



Let y > be arbitrary. We consider two cases: 

casei)4E,ej(;3)A,||A^-||>i||X(/3-/3*)|| 
caseii)4E,ej(^)A,||A^-||<^||X(/3-/3*)|| 

In case i), we have 



ieJ(/3) 
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M 

-||X(/3-r)f + 5^A,||A^||<8 5^ A,||A^||. 
i=i ieJ(/3) 



This implies 

Y^ A,||A^||<7 5^ A,||A^||. 

3<^J{PY i6J(/3) 

Thus, by As sumption [3]T] (with factor 3 replaced by 7), we have 



\^m\\ < 



IXAI 



K^N 



We obtain 



1 ^-^ s / 

i=i V ieJ(,3) 



IXAI 






X 



|X(/3-/3*)||^^(/3-/3* 



X 



iV 



^ l ||X(/3-/3*)f ^ 32 ,p ^2 

ieJ(/3) 



X 



|X(/3-/3*)f , 16 



N 



E4 



i6J(/3) 



Hence 



1 ^ Qfi 9 

i^||X(/3-/3*)|P + 25:A,||A^||<^ $: A,^ + A||x(/3-/3 



*M|2 



i=i 



i6J(,3) 
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Case ii) gives 



Hence 



M 



^ll^(/3-/3*)f + EA,||A^||<|||X(/5-r)f. 






l||X(/3-/3*)f <min 



ieJ(/3) 



We end this section by a remark about the Group Lasso estimator with overlapping groups, 
i.e., when Nx = y^f=iGj but Gj n Gj> ^ for some i,j' e Nm, J ^ f- We refer to [HI for 
motivation and discussion featuring the statistical relevance of group sparsity with overlapping 
groups. Inspection of the proofs of Lemma 13. II and Theorem 13.11 immediately yields the following 
conclusion. 

Remark 3.1. Inequalities (13.21) and (13.31) in Lemma lJJl and inequalities (I3.10I) - (I3.12I) in Theorem 
\3.1\ remain correct in the more general case of overlapping groups Gi, . . . , Gm- 

4 Sparsity oracle inequalities for multi-task learning 

We now apply the above results to the multi-task learning problem described in Section 12.11 In 
this setting, K = MT and N = nT, where T is the number of tasks, n is the sample size for each 
task and M is the nominal dimension of unknown regression parameters for each task. Also, for 
every j E Nm, Kj = T and ^j = {1/T)Itxt, vvhere Itxt is the T x T identity matrix. This fact 
is a consequence of the block diagonal structure of the design matrix X and the assumption that 
the variables are normalized to one, namely all the diagonal elements of the matrix {l/n)X^X are 
equal to one. It follows that tr(^j) = 1 and |||^I/j||| = 1/T. The regularization parameters \j are 
all equal to the same value A, cf. (|2.6I) . Therefore, (13.11) takes the form 



2(7 

/nT 
In particular. Lemma [3Jl and Theorem 13. H are valid for 



A > ^a/i + ^ (2glogM + v^TglogM). (4.1) 



^^ 2V2a I , SglogM 



. nT V 2 T 

since the right-hand side of this inequality is greater than that of (14.11) . 

For the convenience of the reader we state the Restricted Eigenvalue assumption for the multi- 
task case [|22ll . 

Assumption 4.1. There exists a positive number kmt = /^mt(s) such that 

minjJ^^ : | J| < s, A G M^^ \ {0}, ||Ajc||2,i < 3||Aj||2,i j > «:mt, 
where J^ denotes the complement of the set of indices J. 
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We note that parameters k, 0max defined in Section [3] correspond to kmt/\/T and (pui/T 
respectively, where 0mt is the largest eigenvalue of the matrix X^ X/n. 

Using the above observations we obtain the following corollary of Theorem [3TTJ 

Corollary 4.1. Consider the multi-task model (12.51) for M > 2 and T,n > 1. Assume that 
W G M^ is a random vector with i.i.d. A/'(0, cr^) gaussian components, o"^ > 0, and all diagonal 
elements of the matrix X^X/n are equal to 1. Set 

_ 2V2a f^ , AlogM^^^^ 
A — 



riT \ T 

where A > 5/2. Then with probability at least 1 — 2M^^^^/^, for any solution {3 of problem (12.61) 
we have that 

2.^0 -nf< ^(i + ^)"^i,.||,.. (4.2) 

Moreover, if in addition it holds that M(/3*) < s and Assumption WH] holds with kmt = /^mtI^), 
then with probability at least 1 — 2M^~^^/^, for any solution (3 of problem (12.61) we have that 

128a^ s / AlogM 



''MT 



;^l|X(/3-r)r<^^- 1 + ^^) (4.3) 



1/2 

I3*hi < zz^z:-— I 1 + iir^aiii 1 (4.4) 



1 ,,3 ^,,, 32^2^ s f A\ogM 



M(/3) < ^., (4.5) 

where 0mt '■s' ^/ze largest eigenvalue of the matrix X^ X/n. 

Finally, if in addition kmt(2s) > 0, then with the same probability for any solution /3 of 
problem (12.61) we have that 

Note that the values T and y/T in the denominators of the left-hand sides of inequalities (14.31) . 
(14.41 ). and (14.61 ) appear quite naturally. For instance, the norm ||/3 — /3* ||2,i in (14.41 ) is a sum of M 
terms each of which is a Euclidean norm of a vector in M^, and thus it is of the order \/T if all 
the components are equal. Therefore, (14.41) can be interpreted as a correctly normalized "error per 
coefficient" bound. 

Corollary 14. II is valid for any fixed n, M, T; the approach is non-asymptotic. Some relations 
between these parameters are relevant in the particular applications and various asymptotics can 
be derived as special cases. For example, in multi-task learning it is natural to assume that T > n, 
and the motivation for our approach is the strongest if also M ':^ n. The bounds of Corollary 14. II 
are meaningful if the sparsity index s is small as compared to the sample size n and the logarithm 
of the dimension log M is not too large as compared to T. 
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More interestingly, the dependency on the dimension M in the bounds is negligible if the 
number of tasks T is larger than log M. In this regime, no relation between the sample size n and 
the dimension M is required. This is quite in contrast to the standard results on sparse recovery 
where the condition 

log(dimension) ^ sample size 

is considered as sine qua non constraint. For example. Corollary 14. 1 1 gives meaningful bounds if 
M = exjp{n'^) for arbitrarily large 7 > 0, provided that T > n'^. 

Finally, note that Corollary 14.11 is in the same spirit as a result that we obtained in [|22l but 
there are two important differences. First, in [|22l we considered larger values of A, namely with 

1 /9 

(l + ^^^^^ in place of (l + ^^^^) ^^^ and we obtained a resuk with higher probability. We 

switch here to the smaller A since it leads to minimax rate optimality, cf. lower bounds below. The 
second difference is that we include now the "slow rate" result (|4.2I) . which guarantees convergence 
of the prediction loss with no restriction on the matrix X^ X, provided that the norm (2, l)-norm 
of 13* is bounded. For example, if the absolute values of all components of /3* do not exceed some 

constant /3max, then ||/3* ||2,i < /^maxS^T and the bound (EJ) is of the order ^ (l + ^^^) ^'"^ . 

5 Coordinate-wise estimation and selection of sparsity pattern 

In this section we show how from any solution of (12.21) . we can estimate the correct sparsity pattern 
J(/3*) with high probability. We also establish bounds for estimation of {3* in all (2, p) norms with 
1 < p < 00 under a stronger condition than Assumption 13. 1[ 

Recall that we use the notation \E' = j^X^ X for the Gram matrix of the design. We introduce 
some additional notation which will be used throughout this section. For any j, j' in N^ we define 
the matrix ^[j, j'] = -^X.q,X.g ., (note that ^[j, j] = ^j for any j). We denote by \E'[j, j']t^t', where 
t e NK,,t' e Nx,, the (t,t')-th element of matrix ^[j,j']. For any A G R-^' and j G Na/ we set 
A^' = (A, : t G Nk,). 

In this section, we assume that the following condition holds true. 



Assumption 5.1. There exist some integer s > 1 and some constant a > such that: 

1. For any j G Nm and t G f^K it holds that (^[j, i])t,t = (^nd 

Amin'J' 1 



Bax,,J(^[j,j]kt'l < 



'ininN 



2. For any j 7^ f G Na/ it holds that 

max \m,j'])tt\< ^'"'"'^ 

l<t<min(Kj,K^,) ' 14aAmax'S 

and 

max |(^b,j])t,t'| < 



\<t<Ki,\<t'<K^,,ti^t''^ ' 14aAmaxS y/K^K^i 



^3^^r 
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This assumption is an extension to the general Group Lasso setting of the coherence condition 
of II22I introduced in the particular multi-task setting. Indeed, in the multi-task case Kj = T, 
Amin = Amax» ^nd for any j E Nm the matrix X^ is block diagonal with the t-th block of size 
n X 1 formed by the j-th column of the matrix Xj (recall the notation in Section |2T| ) and = 1/T. 
It follows that {^[j,j'])t^t' = for any j,j' E Nm and t y^ t' E Nt- Then Assumption 15.1 [ reduces 
to the following: maxi<t<T |(*[j, j'])t,t| < j^ whenever j ^ f and (^[j, j])t,t = ^. Thus, we 
see that for the multi-task model Assumption lS.ll takes the form of the usual coherence assumption 
for each of the T separate regression problems. We also note that, the coherence assumption in 
[|22| was formulated with the numerical constant 7 instead of 14. The larger constant here is due 
to the fact that we consider the general model with not necessarily block diagonal design matrix, 
in contrast to the multi-task setting of [|22l . 

Lemma IAT2I which is presented in the appendix, establishes that Assumption 15.11 implies As- 
sumption l3.1[ Note also that, by an argument as in [21J, it is not hard to show that under Assump- 
tion lS.ll any group s-sparse vector (3* satisfying (12.11) is unique. 

Theorem 13. 1 1 provides bounds for compound measures of risk, that is, depending simultane- 
ously on all the vectors (3^ . An important question is to evaluate the performance of estimators for 
each of the components (3^ separately. The next theorem provides a bound of this type and, as a 
consequence, a result on the selection of sparsity pattern. 

Theorem 5.1. Let the assumptions of Theorem \3. 1 1 be satisfied and let Assumption \5J\ hold with 
the same s. Set 

Then with probability at least 1 — 2M^ '', for any solution /3 of problem A2.2\) we have that 

\\P-P*hoo<^K... (5.2) 

If in addition, 

min \\{/3*y\\ >^A^ax, (5.3) 

jeJifi*) (p 

then with the same probability for any solution (3 of problem A2.2\) the set of indices 

J=Ij-0'\\>^kA (5.4) 



estimates correctly the sparsity pattern J {(3*), that is, 

j = j{n- 

Proof. Set Koo = maxi<j<Af Kj. We define first for any j, j' E ^m the Kao x K^o matrix ^[j, j'] 
as follows. If J ^ 3' we have (^[j, j'])teN,c^.,t'eNK,, = "^[j.f] and {^[3,3'Wt' = if t > if,- or if 
t' > Ky. If J = / we have {^3,j])t,t'mK^ = '^[3,j]-(I)Ik,xk, and (^[j, j])i,t' = if t > Kj or if 
t' > Kj. Similarly, for any A G M^ and any 3 E Nm we set A^' E ]R^°° such that {Ai)tmK- = ^^ 
and A^ = for any t > Kj. 
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SetA = /3-/3*. Wehave 

0||A||2,oo < ||*A||2,oo + ||(^ - 0/i^x/c)A||2,oo. 

Using Cauchy-Schwarz's inequality we obtain 



(5.5) 



||(^-(/)//^xA-)A||2,oo = max 

l<j<M 



t=l \ i'=l t'=l 



2n 1/2 



< max 



^j / M 



E E(*b./i),/; 

i=l \i'=l 



~j' 



1/2 



+ max 



Kj I M 



K, 



h 1/2 



(5.6) 



E E E (*b-./i),,/;: 

_t=i yj'=it'=i,tvt / _, 

We now treat the first term on the right-hand side of (|5.6I) . We have, using Assumption 15. II and 

Fiirlidean nnrm in IK^J that 



Minkowski's inequality for the Euclidean norm in M^^ that 



max 

i<i<M 



E E(*b-./i),/; 



1/2 



< 



< 



< 



14aAinaxS 

Amin0 

14aAmaxS 

^inin'f' 



Kj / M 

E Ei^fi 

_t=i \i'=i 

A||2,l 



n 1/2 



lAI 



2,1) 



14aAmaxS 

since || A||2,i < || A||2,i by definition of A. Next we treat the second term in the right-hand side of 
(15.61 ). Cauchy-Schwarz's inequality gives 

2n 1/2 



max 

1<3<M 



Kj / M 



E E E (*b./i),/;: 

t=i \j'=it'=i,t'j^t 



< ; max 

14aAmaxS i<i<M 



lA^" 



' Kj I M Kj, 

irE EE 



^i^ L'^i^ V^^J' 



2-, 1/2 



< 



< 






EE 



±'±u;/\iiiax'J ., , ., -, \ i\.Ai 
X=\ t =\ V J 



Amin0 



14aAmaxS 

Combining the four above displays we get 

||A||2,oo < -||*A|| 



A||2,l < 



14aAmaxS 



IIAI 



2,1- 



2,oo 



+ 



^^min IIAI 



14aAmaxS 



2,1- 



18 



Thus, by inequalities (|3.3I) and (13.111) . with probability at least 1 — 2M^ '', it holds that 

/ 3 16 \ 

By Lemma |AT2l an^ = (a — 1)0, which yields the first result of the theorem. The second result 
follows from the first one in an obvious way. ■ 



Assumption of type (|5.3I) is inevitable in the context of selection of sparsity pattern. It says that 
the vectors {(3*y cannot be arbitrarily close to for j in the pattern. Their norms should be at least 
somewhat larger than the noise level. 

Theorems 13.11 and |5 . 1 1 imply the following corollary. 

Corollary 5.1. Let the assumptions of Theorem \3. 1 1 be satisfied and let Assumption |5!7] hold with 
the same s. Then with probability at least 1 — 2M^~'^,for any solution f3 of problem l\2.2\l and any 
1 < p < CO we have that 



rb,<^A_| Y. T-i-\ ■ (") 

ueJ(/3*) 



where 



16a V"^ /3 16 \^ ^ 



--l^j U^TF^l ■ ''■'' 



If in addition, d5.3D holds, then with the same probability for any solution /3 of problem d2.2|) and 
any 1 <p < oowe have that 

||/3-r||2,p<^A^ax(^ , \ I , (5.9) 

where J is defined in ( 15.41) . 

Proof. Set A = (3 — (3. For any p > 1 we use the norm interpolation inequality 



All ^ II A II p II A ir p 
, \\2,p S ll^lb.ill^ 



1- 

2,oo 



Combining inequalities (13.111) and (15.21 ) with k = a/(1 — l/a)0 (cf . Lemma IA.2I) and the last 
inequality yields (15.71) . Inequality (15.91 ) is then straightforward in view of Theorem 15. 1[ ■ 

Note that we introduce inequalities (15.21) and (|5.9I) valid with probability close to 1 because 
their right-hand sides are data driven, and so they can be used as confidence bands for the unknown 
parameter (3* in mixed (2,p)-norms. 

We finally derive a corollary of Theorem [5T| for the multi-task setting, which is straightforward 
in view of the above results. 
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Corollary 5.2. Consider the multi-task model (|2.5I) for M >2 and T,n > 1. Let the assumptions 
of Theorem \5. 1 1 be satisfied and set 

^_2V2a r AlogM y^ 



where A > 5/2. Then with probability at least 1 — 2M ^ 2A/5^ j-^^ ^^y gQinfigyi j^ of problem ^2.6\i 
and any 1 <p < oowe have 

' ' ^.||,,<M£l2£!^fi + £i^V'\ (5.10) 



/T" " '^- V^ \ T 

where Ci is the constant defined in 0.81) and we set x^^°° = Ifor any x > 0. If in addition, 

1 n.o*^,■„ 4:^/2ca f AlogMV'^ 
min — /37 >^^ 1 + ^^ , (5.11) 

then with the same probability for any solution /3 of problem MM the set of indices 




estimates correctly the sparsity pattern J{(3*), that is, 

J=J(/3*). 

6 Minimax lower bounds for arbitrary estimators 

In this section we consider again the multi-task model as in Sections [ZTI and l4l We will show that 
the rate of convergence obtained in Corollary 14. H is optimal in a minimax sense (up to a logarithmic 
factor) for all estimators over a class of group sparse vectors. This will be done under the following 
mild condition on matrix X. 

Assumption 6.1. There exist positive constants Hi and K2 such that for any vector A G M^^^ \ {0} 
with M(A) < 2s we have 

Note that part (b) of Assumption 16. II is automatically satisfied with h\ = </)mt where (J)mt is 
the spectral norm of matrix X X/n. The reason for introducing this assumption is that the 2s- 
restricted maximal eigenvalue k^ can be much smaller than the spectral norm of X X/n, which 
would result in a sharper lower bound, see Theorem 16.11 below. 

In what follows we fix T > 1, M > 2, s < M/2 and denote by GS{s, M, T) the set of vectors 
(3 e M*^^ such that M(/3) < s. Let £ : R+ ^ R+ be a nondecreasing function such that i{0) = 
and i^O. 
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Theorem 6.1. Consider the multi-task model (l2.5h /or M > 2andT,n > 1. Assume that W G M^ 
is a random vector with i.i.d. Af{0, a"^) gaussian components, a^ > 0. Suppose that s < M/2 and 
let part (b) of Assumption \6.1\ be satisfied. Define 

a s^l-P ( log(eM/s)\^/^ 
^„,p = ^ 1+ V ^ M , l<P<oo, 



where we set s^/°° = 1. Then there exist positive constants b, c depending only on £{■) and p such 
that 

inf sup Ei(bij-^p^\\T-/3*\\2,p]>c, (6.1) 

'^ l3*&GS{s,M,T) V Vi / 

where inf,- denotes the infimum over all estimators r of P*. If, in addition, part (a) of As sump - 
tion \6.1\ is satisfied, then there exist positive constants b, c depending only on i(-) such that 

inf sup M(bip-l ^||X(r - /3*)|| ) > c. (6.2) 

^ (S*&GS{s,M,T) \ ' KiVnT / 

Proof. Fix p and write for brevity tpn = i^n,p where it causes no ambiguity. Throughout this proof 
we set a;^/°° = 1 for any a; > 0. We consider first the case T < log(eM/s). Set = (0, . . . , 0) G 
E^, 1 = (1, . . . , 1) e M^. Define the set of vectors 

VL= [ue M^^^ : uj^ e {0, 1}, j = 1, . . . , M, and M{u) < s} , 

and its dilation 

where 7 > is an absolute constant to be chosen later. Note that C{Q,) C GS{s, M, T). 

For any cu, u' in Vl we have M{uj — to') < 2s. Thus, for (3 = 'yipn,pUj/s^^P, 13' = jtpn^p^^' /s^^^ 
parts (a) and (b) of As sumption [611 imply respectively 

\||A-/^-X/^-|P> '';-''0(---')^ . (6.3) 



-\\XP-XP'f < '^ ^"'^,, ^ (6.4) 

n s IP 

where p(u,uj') = X]/=i-^{'^"' ¥" ('^0''} ^^'^ ^{'} denotes the indicator function. This and the 
definition of -ipn^p yield that if part (a) of Assumption 16 . 1 ] holds, then for all uj,co' E i^ we have 

Also, by definition of (3, 13' , 

1 ,,^ ^,.. 7a / log(eM/s^^^/^ 



T K2y/n 
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/3||2,p = — ^ 1 + 7^ [p{uj,uj)) 'n{uj^[uj)}. (6.6) 



For 9 E M^, we denote by Pg the probability distribution of J\f {9, a"^ In ^n) Gaussian random 
vector. We denote by /C(P, Q) the Kullback-Leibler divergence between the probability measures 
P and Q. Then, under part (b) of Assumption |6.ll 

IC{Pxp,Pxp') = ^||X/3-X/3'f 

< -f^s[T + log{eM/s)] 

< 27^5 log(eM/s) (6.7) 

where we used that p{uj, cu') < 2s for all u, u' G il. Lemma 8.3 in [[321 guarantees the existence 
of a subset A/" of f2 such that 

hgi\^f\) > ~cs\og(^^ (6.8) 

p{ijj,uj') > s/4:,WijJ,uj' E Af, OJ j^ u' , 

for some absolute constant c > 0, where \J\f\ denotes the cardinality of A/". Combining this with 
(1631) and (l6^ we find that the finite set of vectors C{M) is such that, for all (3, (3' e C{X), /3 7^ /?', 

' _«'|| ^ l^s^P A , log(eM/g) V/^_ 7 ^ 



r" '^"''^- 41/pk2v^ V ^ y 4i/p"^"'P' 



and under part (a) of As sumption [6. 11 

^ WYR vn'u2^^^2 '^Ws [^ , \og{eM/s) \ 7^ 
— IIX/3-X/3II >7^;^(^1 + ^ J=X^iV^n,2. 

Furthermore, by ([6.7[) and ([6.8[) for all /3, /3' G C(A^) under part (b) of As sumption [6J] we have 

/C(Px/.,Px/3') < ^\og{\M\) = llog(|C(Ar)|) 

for an absolute constant 7 > chosen small enough. Thus, the result follows by application of 
Theorem 2.7 in [35]. 

Consider now the case T > log(eM/s). Introduce the set of vectors 

Vl' = {u e M^^^ : u = iuj\ ...,uj^),uj^e {0, 1}^ if J < s and tu^ = otherwise} , 

and the associated dilated set C{Q') defined as above. Note that C{Q') C GS{s, M, T). 

For any u.u' e Q' we define p'{u,u') = J2jLi Ef=i H^tj 7^ uj't^} = J^Ui ^^=1 ^{^tj 7^ 

We assume first that Ts > 8. Then Varshamov-Gilbert Lemma (see Lemma 2.9 in [[35l ) 
guarantees that there exists a subset J\f' of Q' such that 

\Af'\ > 2^^/^ (6.9) 

Ts , , 

p'{u,u') > -— , Va;,a;' G A/"', cj 7^ cu'. 
8 
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Next for any to, to' E M' we have M(uj — to') < 2s, and thus under parts (a) and (b) of Assump- 
tion [6T| we have, respectively, 



-\\XI3- X/3'r > ^^J:i^^_P^ , _||x/3 - X/3'r < 



n" ■ s^/p ' n s^/p 

where /3 = 'yxjjnoj/s^^^, (3' = ^ipn^' / s^^^ are any two elements of C(A/''). 

Now, using Lemma IA3] in the Appendix we get that, for all w, uj' E N' such that uj ^ uj' , 

/ s \^/p Vt 
\\co-uj'\\2,p>[—) — , Vl<p<oo. (6.10) 

Thus, for all f3, f3' E C{N') such that /3 ^ /3' we have 

^ II fl fl'ii ^^" II 'II ^ ^ / 



(recall that ipn = ipn,p), and under part (a) of As sumption 16 .11 



Furthermore, for all /3, (3' E C{J\f') under part (b) of Assumption l6.1[ 

]C{Pxp,Pxp') < 2f.T<-^log(|C(Ar')|), 

where, in view of (|6.9I) . the last inequality holds for an absolute constant 7 > chosen small 
enough. We apply again Theorem 2.7 in [|M1 to get the result. 

Finally, if T > log(eM/s) and Ts < 8, then the rate -ipn is of the order 1/n. This is the standard 
parametric rate and the lower bounds are easily obtained by reduction to distinguishing between 
two elements of G'5'(s,M,T). ■ 



As a consequence of Theorem 16. 11 we get, for example, the lower bounds for the squared loss 
i{u) = n^ and for the indicator loss £(m) = I{u > 1}. The indicator loss is relevant for comparison 
with the upper bounds of Corollaries 14.11 and 15. 2[ For example. Theorem 16.11 with this loss and 
p = 1,2 implies that there exists (3* E GS{s, M, T) such that, for any estimator r of (5*, 



^Wr" ' ' '"- y n\ T 



and 



1 «- s-,,>c,^(i^M^Mm"\ 1 ||,_fl.|u.>c^ri + '°«'^^'/^'''"' 



/T ~ V ^ V T J VT ~ V^ V T 

with a positive probability (independent of n, s, M, T) where C > is some constant. The rate on 
the right-hand side of these inequalities is of the same order as in the corresponding upper bounds 
in Corollary 14.11 modulo that log M is replaced here by log(eM/s). We conjecture that the factor 
log(eM/s) and not log M corresponds to the optimal rate; actually, we know that this conjecture 
is true when T = 1 and the risk is defined by the prediction error with (i{u) = u^ ||32l . 

A weaker version of Theorem 16. 1[ with l{u) = v?, p = 2 and suboptimal rate of the order 
[slog(M/s)/(nT)]i/2 is established in yjj. 
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Remark 6.1. For the model with usual (non- grouped) sparsity, which corresponds to T = 1, the 
set GS{s, M, 1) coincides with the £o-ball of radius s in M^^. Therefore, Theorem \6.1\ generalizes 
the minimax lower bounds on (.Q-balls recently obtained in 001/ and IU2^ for the usual sparsity 
model. Those papers considered only the prediction error and the £2 error under the squared loss 
i{u) = v?. Theorem \6.1\ covers any ip error with 1 < p < 00 and applies with general loss 
functions £(■). As a particular instance, for the indicator loss i{u) = I{u > 1} andT = 1, the 
lower bounds of Theorem \6. 1 1 show that the upper bounds for the prediction error and the ip errors 
(1 < p < ooj of the usual Lasso estimator established in [4] and [21] cannot be improved in a 
minimax sense on io-balls up to logarithmic factors. Note that this conclusion cannot be deduced 
from the lower bounds of ^3W and / |52|/ . 

7 Lower bounds for the Lasso 

In this section we establish lower bounds on the prediction and estimation accuracy of the Lasso 
estimator. As a consequence, we can emphasize the advantages of using the Group Lasso estimator 
as compared to the usual Lasso in some important particular cases. 
The Lasso estimator is a solution of the minimization problem 

minl||X/3-i/f + 2r||/3||i, (7.1) 

/3eM-^ IS 

where ||/3||i = ^,=1 |/3j| and r is a positive parameter. The following notations apply only to 
this section. For any vector /3 e R^ and any subset J C Nx, we denote by f3\j the vector in 
R^ which has the same coordinates as /3 on J and zero coordinates on the complement J^ of J, 
J'(/3) = {3 : /3, ^ 0} and M'(/3) = | J'(/3)|. 

We will use the following standard assumption on the matrix X (the Restricted Eigenvalue 
condition in flU). 

Assumption 7.1. Fix s' > 1. There exists a positive number n' such that 

mini Ij^^" : | J| < .', A G R^ \ {0}, J^ |A,| < 3 J] |A,|| > < 

where J'^ denotes the complement of the set of indices J. 
Theorem 7.1. Let Assumption \7.1\ be satisfied. Assume that W G R^ is a random vector with 



i.i.d. A/'(0, cr^) gaussian components, o"^ > 0. Set r = AaJ^^^^ where A > 2\f2 and is the 
maximal diagonal element of the matrix ^ = ^X^X. If (3^ is a solution of problem (|7.1I) . then 
with probability at least 1 — K ~ we have 

N 40maxA' 

ll^^-/?1l > ^^'(/3-)^. (7.3) 
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where 0max is the maximum eigenvalue of the matrix \1/. If, in addition, M'(/3*) < s' , and 

/3 16s' \ 

min{|^,,/5;| : j G N™, /?* ^ 0} > - + ^^max |^,fc| r, (7.4) 

where "^jk denotes the {j, k)-th entry of matrix ^, then with the same probability we have 

M'0^) > M'{P*). (7.5) 

Proof. Inequality (B.3) in ^ yields dTIl) on the event A = lj^\\X^W\\oo < || of probability 

P(^) > l-K^-^. 

Next, (17.31) follows from (17.21) and the inequality 

^(/3^ - /3*)^X^X(/3^ - (3*) < 0^ax||/3'' - (3T- 

We now prove (EJ). If M'(/3^) < M'{f3*) then there exists j G J'(/3^)'= n J'(/3*). Set 
A = (3* — /3^ and recall that \E' = j^X X. Using that any Lasso solution (3^ satisfies 



'UX^y - X/3^)), = sign(/3/^)r, if /3/^ ^ 0, 



Af 



(^ (y-^/3")). 



<r, if/3f = 0. 



(7.6) 



and the triangle inequality we get, on the event A, that |(^A)j| < y- Consequently 

|vl>../3*| = IvI/.-A-l 






3r 

2 ■ "-"^ -J^k 



< — + ||A||i max|^jfc|. (7.7) 



Next, Corollary B.2 in Q yields that, on the event A, 

||A|j'(/3.)c||i < 3||A|j/(/3.)||i. 

Thus, the Cauchy-Schwarz inequality. Assumption 17. II and [T, Inequality (7.8)] give that, on the 
event A, 

||A||i < 4||A|j,(^.)||i < 4v^||A|j,(;,.)|| < 1^(A"M/A)1/2 < l^r. (7.8) 

Combining (17.71) and (17.81) yields, on the event A, that 

which contradicts the condition (17.41 ). 
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Let us emphasize that the Theorem 17 . 1 1 establishes lower bounds, which hold for every Lasso 
solution if (3l is not unique. 

Theorem |7. ll highlights several limitations of the usual Lasso as compared to the Group Lasso. 
Let us explain this point in the multi-task learning case. There, the usual Lasso estimator /3^ is a 
solution of the following optimization problem 

-. T T M 

By comparing the prediction error lower bound in Theorem 17. II for this estimator with the corre- 
sponding upper bound for Group Lasso estimator derived in Corollary 14. 11 we reach the following 
conclusions. 

• The usual Lasso does not enjoy any dimension independence phenomenon as compared to 
the Group Lasso. 

In the multi-task learning setting we have A^ = nT, K = MT. Assume that the tasks' 
design matrices are orthogonal, namely XJ Xt/n = Imxm for every t E N^. Hence, \E' = 
Itmxtm/T, so that 0max = (p = 1/T and ^! jj = 1/T for all j. Let a special instance of 
group sparsity assumption be realized, namely, all vectors (31 have exactly s non-zero entries 
at the same positions. Then, M(/3*) = s and M'(/3*) = sT. Moreover, condition (17. 4|) 
simplifies to the requirement that 



. |.*|^3Aa log(MT) 
mm p.- > \ . 

We conclude by inequalities (17.21) and (17.51) that, with probability at least 1 — (MT)^~"8", 

^iix(/3^ - nr > aW^^^. (7.9) 

nl An 

This bound holds no matter what the number of tasks T is. In contrast, the bounds in Corol- 
lary |4T| can be made independent of the dimension M and of the number of tasks T as soon 
as T > logM. Specifically, under the above assumptions we have, recalling Definition 14. 1[ 
that kmt > 1 and by (|4.3I) . with probability close to 1, every Group Lasso solution (3 satisfies 

-l||X(/3 - nr < 128a^^ (l + ^i^) . (7.10) 

• The Group Lasso achieves faster rates of convergence in some cases as compared to the usual 
Lasso. We consider separately two cases. The first one is already discussed the preceding 
remark. It corresponds to T > log M. Then the upper bound for the Group Lasso (17.101) is 
smaller than the lower bound (17.91 ) for the Lasso by a logarithmic factor. This factor can be 
large if T is large, for example exponential in n, so that (17.91) gives no convergence result 
for the Lasso. The second case is T < log M. Then the lower bound (17.91) is of the order 
s(logM)/n, while the upper bound (17 . 1 01) is of the order s (log M)/(nT). The ratio is of the 
order T in favor of the Group Lasso. 

In (17.91 ) and (17.101) we have only compared the prediction errors of the two estimators. In view of 
inequality (14.61) and Theorem 17. 11 similar observations are valid for the £2 estimation errors. 
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8 Non-Gaussian noise 

In this section, we show that the above results extend to non-gaussian noise. We consider here 
the multi-task setting described in Section [ZTI and we only assume that the components of random 
vector W are independent with zero mean and finite fourth moment E[14^^'!]. As we shall see the 
results remain similar to those of the previous sections, though the concentration effect is weaker. 
We need the following technical assumption. 

Assumption 8.1. The matrix X is such that 

-^^max \{xti)j\^ < x\ 



max ■ " ™ „„,,„,,-. ^ ™2 

i=l 



for a finite constant x*. 

This assumption is quite mild. It is satisfied for example, if all {xti)j are bounded in absolute 
value by a constant uniformly in i, t, j. We have the two following theorems. 

Theorem 8.1. Consider the model (|2.1l) /or any M > 2, T,n > 1. Assume that the components 
of random vector W are independent with zero mean, maXigNrjeNM ^^[^4] — ^^' ^^^ diagonal 
elements of the matrix X^X/n are equal to 1 and M{(3*) < s. Let also Assumption \8J} be 
satisfied. Set 



nT V VT 



• 7 r r^ rr., . , , , .,• , 1 4v/log(2Af)[(81og{12M))2 + l]l/2 i ■ o i- 

With > 0. I hen with probability at least 1 Qo m)3/^+^ ' ■''''" ^"^ solution p of 

problem (12.61) we have 



If in addition, Assumption \4. 1 1 holds, then with the same probability for any solution (3 of problem 
(12.61 ) we have 



i|i;^_,.|i,,<i§:i^4fi,a2i^^V^ (S.3) 



T '^MT V '^ V yT 

M0) < ^., (8.4) 



K 



MT 



where 0mt i^ the largest eigenvalue of the matrix X^ X/n. If, in addition, nyi-^{2s) > 0, then with 
the same probability for any solution [3 of problem (12.61) we have 



0-n<*J^f.U^^^^'"y"'*'\''' 



T K^{2s) \ n \ ^/T 
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Theorem 8.2. Consider the model l\2.1\) for M >2, T,n > 1. Let the assumptions ofTheorem \8.1\ 
be satisfied and let Assumption \5. ll hold with the same s. Set 

\2 7{a-l] ' 



Let X be as in Theorem \8.1\ Then with probability at least 1 — no m)^/^+^ ' f^^ ^"^ 

solution (3 of problem d2.(5D we have 



If, in addition, it holds that 

mm —=\\[p y\\ > —= 1 + 



ieJ{/3*) ^/T Vn\ Vt 

then with the same probability for any solution (i of problem A2.6\l the set of indices 



^^^^^7f"^'«>7sl^ 



estimates correctly the sparsity pattern J{/3*): 

J=J(/3*). 



Proof. The proofs of these theorems are similar to those of Theorems 13 . 1 1 and |5 . 1 1 up to a modifi- 
cation of the bound on P(^^) in Lemma IBTI We consider now the event 



A I M 

A= { max 



T / n 



\ t=i \i=i 



Define the random variables 

>^t. = E(^*^W*0 -El(^*^)il'^[^*']' J = 1,---,M, t = l,...,T. 



.1=1 / i=l 



We have 



< P [ max ^Ytj > xy nVT (log Mf/^+^ 



t=i 



< 



x262nVT(logM)3/2+5 
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Applying the maximal moment inequality of Lemma 19 . 1 1 below with m = 1 and constant c(l) = 2 
we obtain 



E max 
i<i<M 



E^. 



t=i 



< ^/8 log(2M) E 



Emax Y,, 
l<j<M ■' 
t=l 



1/2N 



(8.5) 



< VS log(2M) 



EH. 



max Y, 



<j<M 
T 



tj 



1/2 



< A^log{2M) lb'^xtn^T + ^E\ max 



t=i 



<J<A/ 



Xl(^*')j-^*' 



i=l 



1/2 



By the maximal moment inequality of Lemma 19.11 with m = 4 and constant c(4) = 12 (since 
M > 2) the last expectation is bounded, for any t = 1, . . . , T, as 

2^ 



E I max 

i<i<A'/ 



^ixu)jWk 



i=l 



< (81og(12M))2E 



Emax {xu)'^M^i 



i=l 



l<i<A'/ 



Setting for brevity Xj = maxi<j<M(a;ii)^ we have 



E 



Emax (xti)^W; 



1=1 



l<j<M 






\i^k i=l 



j=i 



Combining the above four displays yields 



P(^'^) < 



4^\og{2M) (81og(12M))2 + l 



1/2 



(logM)3/2+'5 



9 Maximal moment inequality 

In this section we prove the following inequality for the m-th moment of maxima of sums of 
independent random variables. 

Lemma 9.1. (Maximal moment inequality) Let Zi, . . . , Zn be independent random vectors in 
M^, and let Zi , denote the j-th component of Z^. Then for any m > 1 and M > Iwe have 



E I max 



J2 (Z^,, - EZ,- ,] 
i=l ^ -^ 



< 



81og(c(m)M) 



-1 m/2 



E 



max 



E^i 



-1 m/2^ 



j=l 



where c{m) = min{c >0: e™' ^ — l<(c — 2)M}. In particular, 2 < c(m) < e™' ^ + L 
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Before giving the proof, we make some comments. The case m = 2 of Lemma 19.11 implies 
- modulo constants - Nemirovski's inequality (see iTT], page 188, and [fT3l . Corollary 2.4). In 
general, Nemirovski's inequality concerns the second moment of £p-norms (l < p < oo) of 
sums of independent random variables in M.^'^, whereas we only consider p = oo. On the other 
hand, even for m = 2 Lemma 1911 is more general than what is given by Nemirovski's inequality 
because we interchange the maximum and the sum on the right hand side. The case Af = 1 of 
Lemma|9T|yields the Marcinkiewicz-Zygmund inequality (see [i29l , page 82), and as an immediate 
consequence the inequality 



E 



J]^J" < [81og(c(m))]™/2W2-i^E|er , rn>2, 



(9.1) 



j=i 



for independent zero-mean random variables ^j. Thus, as a particular instance, we give a short 



proof of (19.11 ) and provide the explicit constant. This constant is of the optimal order in m but 
larger than the one obtainable from the recent sharp moment inequality due to Rio ll33l . 



Proof. Let {ei, . . . , £„) be a sequence of i.i.d. Rademacher random variables independent of Z = 
(Zi, . . . , Zn). Let Ez denote conditional expectation given Z. By Hoeffding's inequality, for all 
L > and all i and j. 



EzexplZijEijL] < exp[ZtJ{2L% 



(9.2) 



Define 



c 



max 



E^.. 



j£i 



i=l 



1) is concave for 



Using successively Jensen's inequality (the function x H- log™ (a; + e™ ^ 

X > 1), the inequality e'^l < e^ + e~^', V a; G M, the independence of e,, and (19.21 ), we obtain 



MD < i:'"Ezlog'"<^exp[C/L] + e 



^m— 1 



< L"" log'"<{ Ez exp [C/L] + e""-^ - 1 

M 



< L'"log'"i^Ezexp 






ij^i 



/L 



+ e""-^ - 1 



< L'"log'"<^2Mexp 



max yZ^J{2L' 



^<j<M 



i=l 



+ e'"-^ - 1 



Note that 2Mx + e™ ^ — 1 < c{m)Mx for all x > 1, where c{m) is the constant defined in the 
statement of the lemma. This and the previous display yield 



Ez(0 < L™ log'" <^c(m)M exp 



max y Z?./(2L^ 



'^<3<M ^ 



i=l 



L™ <^ log(c(m)M) + 



maxi<j<A/ YJi=i Z, 
2L2 



*j 
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Choosing 



gives 



Hence, 



Ez max 
i<i<M 



L = 


1 


/maxi<j<M Er=i Ki 
21og(c(m)M) 


n m^ 




n 


E^^.^-^^ 




< 


21og(c(m)M) max VZ^. 


i=l 


/ 




L i=i 



m/2 



E I max 

i<i<Af 



n 


m\ 


5Z ^^.^-^^ 


U 


j=l 


/ 



21og(c(m)M) 



-1 m/2 



E 



max > ^?- 

4 = 1 



m/2N 



Finally, we de-symmetrize (see Lemma 2.3.1 page 108 in BTII ) 



E max 



2_^(Zij -EZjj' 



1=1 



l/m 



< 2 I E max 
i<i<A/ 



i=l 



*J * 



l/m 
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A Auxiliary results 

Here we collect some auxiliary results which we have use in the paper. 

The first result is taken from [|9l Eq. (27)] and was used in the proof of Lemma [3711 

N 

Lemma A.l. Let ^i, . . . ,^n be Ltd. M{0, 1), v = {vi, . . . , v^) 7^ 0, r/^ = J., ., ^ (^.f — l)f j and 



V2\M 



i=l 



m(v) 



%^. We have, for all a; > 0, that 

v\\ J 



\H 



\f]v\ > a;) < 2exp 



X 



2(l + y2xm(t;))/ ' 



The next lemma provides the link between Assumptions 15. II and 13. II and was used extensively 
in our analysis in Section [51 

Lemma A.2. Let Assumption\5J\be satisfied. Then Assumption \3J] is satisfied with k = a/(1 — l/a) 
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Proof. We use here the notations introduced in the proof of Theorem 15. II For any subset J of Nm 
such that I J| < s and any A G M^ we have 



\A}{^-<I>Ik.k)Aj\< EEE 

jj'eJ t=i t'=i 

nim{Kj,Kji 



nj,j'] 



nj,f] 



t,t' 






lA-^IIA-^ 



t,t 



|a;iiaj 






nj,j'[ 



t,t' 






I A^ 1 1 A^ 



We now treat separately the first and second terms in the right-hand side of the above display. For 
the first term we have, using consecutively Assumption 15. 1[ Cauchy-Schwarz and Minkowski's 
inequality for the Euclidean norm in R^^ , that 



K, 



EE 



nj,f] 



t,t 



I At 1 1 A* I < 



< 



'^min'f' 
14aAmaxS 

14aAmaxS 



K, 



E Ei^^i 

t=i VjeJ 

||A "2 



J||2,1 



, ^min0 11 . 11 2 

14aAmax 

For the second term we get, using Assumption 15. H and Cauchy-Schwarz's inequality twice, that 

2 



EE E 



^b,/] 



t£ 



lA-^'llA-' I < 
It I It' I — 



< 



Amin0 
14Q;AmH 



e4-Eia;i 



max- \ .^j Y Kj ^^^ 



A,f. 



Combining the two above displays yields 



|A,||2 



+ 



AJ(^-0/;,x;^)A. 



|A,/| 



> 



1- 



Z\r 



lAaXr 



We proceed similarly to treat the quantity | A jc v]/ A j | . We have, using Assumption l5.ll Cauchy- 
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Schwarz and Minkowski's inequalities, that 



j&J^J'^J t=l 



nj,j'] 



t,t 



\M\\M\ 



Kj Kf 



jeJ^j'eJ t=i t'=i,ti^t 



ni^f] 



t,t' 



lA-^'llA^' I 



, Amin0 II » II II , II 



14Q;Arr,axS 



+ 






14aAmax5 \ ^ f-! JK^ 
jeJ t=l V J 



., K 

j(zjc t=l V ^^3 



\^t\ 



< 



2A, 



14Q;Arr,a,xS 



l^jIballAjclU 1. 



that 



Next we have, for any vector A G M satisfying the inequality Xlje j^ ^j II ^'^ II — 3 Xlie j ^i II ^"' II ' 



lA 



J'\\2,l 



< 



< 



E ii^'ii 

y ^iiA^i 



3 



E^.ii^^' 



inm .^ J 

oAjjiax 11 A II 
^ — IK^J 2,1- 



Ami 



min 



Combining these inequalities we find that 



A^^A A}^Aj 2A}.^Aj 



|A,||2 



|A,|P 



|A.,||2 



> 



2A„,in0 120||Aj|||i 
14as||ArF 



> 1 



14aAr, 



a 



Lemma A.3. Let Ts>8.Ifu and u' are two elements of M' such that p'{uj, u') > -^, then the 
cardinality of the set J{uj, uj') = \j<s: X]f=i I{^tj 1^ ^tj) ^ iq\ ^^ greater than or equal to 



16' 
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Proof. Assume that \J{uj, uj')\ < s/16. Then, denoting by J{uj, co'Y the complement of J{uj, cu'), 
and using that \J{iu, iu'Y\ < s, we get 

T 

which contradicts the premise of the lemma. ■ 
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